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ABSTRACT 


A spectrum analyzer based on Fast Fourier Transform (FFT) techniques was 
implemented using the TMS320C6201 Digital Signal Processor device manufactured by 
Texas Instruments. Portable C programs demonstrated optimization of the FFT algorithm 
for maximum speed on a general-purpose processor. Previously published algorithms 
were then adapted to the unique features of this Very-Long Instruction Word (VLIW) par- 
allel processor and and performance requirements of this application, taking into account 
fixed-point arithmetic, parallel operation of functional units, and a hierarchy of memory 
capacities and speeds. The effectiveness of the VLIW C compiler, with automatic optimi- 
zation, is compared with an explicitly-scheduled assembly-language program. The result- 
ing program was then used to demonstrate the crucial need to keep program data in the 


Internal Data Memory to preserve hard-won performance gains. 
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EXECUTIVE SUMMARY 


The problem: Monitor the radio-frequency environment in a region of interest for unau- 
thorized transmissions from unknown sources given constraints on system size and cost. 


The approach: Design a spectrum analyzer based on the Fast Fourier Transform (FFT), 
and implement the FFT as efficiently as possible. Compare general-purpose computers 
(SGI R4000 and Intel Pentium-3) with a specialized Digital Signal Processor (DSP, the 
Texas Instruments TMS320C6201), deriving efficient algorithms for this specific applica- 
tion. 


The major results: 


1. 16-bit integer arithmetic was found to be adequate for implementation of the FFT in this 
application, as long as partial results are appropriately scaled to prevent arithmetic over- 
flow. (An integer FFT algorithm which includes scaling of partial results is original work.) 


2. The VonHann window function, used in the Welch method of averaged periodograms 
for spectrum estimation, provides adequate spectrum estimation accuracy (significantly 
better than the popular Hamming window), and can be derived as needed from the tabu- 
lated sine and cosine factors used in the FFT. 


3. For FFT data sets which cannot fit in the cache of a general-purpose computer (or in the 
Internal Data RAM of the DSP) relying on automatic memory management to provide 
data to the FFT leads to a dramatic increase in the run time. When computing a 1048576- 
point transform on the RISC processor, for example, the processor is idle waiting for 
cache updates 80% of the time. An algorithm which factors 1048576 into 1024 transforms 
of 1024 points each recovers most of this idle time, running in less than one third the time 
of the original algorithm. While this algorithm has been published in Fortran, the ANSI-C 
implementation (Appendix, part C) is original work. The comparison between actual run 
times and run times extrapolated from small data sets sizes, to assess cache effectiveness, 
is also original. 


4. The algorithm which factors a large transform into a sequence of smaller transforms 
suggests a scheme for computing a large transform on a massively parallel processor, 
though this was not implemented. This scheme would be particularly useful when samples 
are acquired at a rate which exceeds the input bandwidth of any single processor’s mem- 
ory. 


Xill 


5. The “factored FFT” algorithm relies on a matrix-transposition routine which also allows 
efficient management of the processor cache. This ANSI-C program (Appendix, part B) is 
original work. 


6. Software development for the DSP environment was found to be more difficult than for 
the general-purpose computing environment, as described below. 


Starting with a published FFT algorithm, improvements to the ANSI-C source code 
(Appendix, part A) reduced the run-time for a 4096-point FFT on the R4000 from 23 msec 
to 11 msec . After converting from floating-point to integer arithmetic, the function ran in 
12 msec on the DSP with all compiler optimizations disabled. Enabling all compiler opti- 
mizations reduced the run time to 1.4 msec, yet the expert-optimized assembly language 
version ran in just 0.40 msec. We interpret this to mean that approaching the advertised 
performance on complex algorithms requires expert programming at the assembly lan- 
guage level. (Such algorithms may be provided in off-the-shelf libraries, though.) Modifi- 
cations to the optimized assembly language to perform intermediate result scaling is 
original work. The use of a “financial spreadsheet” (e.g., Excel, Gnumeric) for scheduling 
parallel processor operations is also original work. 


On the Pentium-3, portable ANSI-C code ran in 0.79 msec, and Intel’s optimized library 
code ran in 0.32 msec. We interpret this to mean that Intel’s C compiler comes closer to 
achieving peak performance than the DSP compiler does. (Note that the 100 MHz R4000 
processor is at least five years older than the 733 MHz Pentium-3; this is not “a fair race” 
in absolute terms.) 


7. Comparing general-purpose processors, we find that the Pentium-3 system is at least 
twice as fast as the RISC R4000 after compensating for the difference in clock rates, 
which we attribute to architectural differences. The Pentium-3 is over four times as fast for 
the 1048576-point transform, which reflects a faster memory system. 


8. Comparing the Pentium-3 and the DSP, we find that the Pentium-3 was slightly faster in 
completing a 4096-point floating-point transform than the DSP was in computing the inte- 
ger transform. Though the Pentium-3 processor is more expensive than the DSP, the enor- 
mous volume of systems which incorporate the Pentium-3 has driven the system price 
(Compaq Proliant) to a fraction of the price of the DSP system (Pentek 4290). On the 
other hand, specialized signal processing peripheral devices which are required to provide 
sampled data to the processor (e.g., a radio receiver with an 8 MegSample/sec output path) 
are simply unavailable for the Intel architecture. 


Though the DSP has a clear advantage in high-volume markets for highly-integrated sys- 
tems (e.g., modems), developers of unique systems for niche markets must carefully eval- 


X1V 


uate the current state of commercial products in the context of their application to get the 
best configuration. 


Significance: The work described in this thesis has advanced our development of two 


spectrum analysis instruments. This thesis may provide useful guidance to others, espe- 
cially to those working with FFTs of more than 65536 data points. 
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I. INTRODUCTION 


The Fourier Transform is one of the fundamental tools of electrical engineering. Its 
sampled-data version, the Discrete Fourier Transform (DFT), is crucial to several areas of 
digital signal processing (DSP). It is used for digital filter design and implementation, 
time-delay estimation, image compression coding, and spectrum analysis. 

Spectrum analysis is the context of this thesis. Spectrum conflict management, sig- 
nals intelligence, technical security, space probe telemetry, and the (thus far incomplete) 
Search for ExtraTerrestrial Intelligence (SETD all rely on detecting the presence of radio 
signals of unknown frequency, power, and modulation. 

Though the general topic of spectrum estimation is still an area of research, the 
work described in this thesis considers only Welch’s method of averaged modified peri- 
odograms using the Fast Fourier Transform (FFT) [Ref. 1: p.553]. Since the populariza- 
tion of the FFT algorithm for computing the DFT by J. W. Cooley and J. W. Tukey in the 
mid-1960s, numerous researchers have studied ways to compute it as quickly as possible 
with the technology of the time. Innovations in computer architecture have enabled evolu- 
tion of FFT algorithms. 

Digital signal processing (DSP) can be done with general purpose computers, but 
computer architectures optimized for digital signal processing have been implemented in 
microprocessor form for almost two decades. Most DSP algorithms rely heavily on multi- 
plication and addition, so early DSP devices dedicated a substantial fraction of their chip 
area to single-cycle multiplication hardware. General-purpose microprocessors of that 
time performed multiplication through shifting and adding (sometimes implemented in 
microcode, but other times left as an exercise for the programmer). General purpose com- 
puters (and microprocessors) almost always fetch both programs and data from a unified 
memory subsystem (the Von Neumann architecture) [Ref. 2:p. 24]. 

The Harvard architecture, on the other hand, provides four separate buses: program 
address, data address, instruction, and data [Ref. 2:p. 200]. This increases the rate at which 


sampled signals can flow through the processor, at the expense of added complexity and 


lost flexibility (a small program which manipulates a large data set may not fit into a sys- 
tem designed for a large program which manipulates a small data set). 

Major manufacturers of DSP devices are Motorola (56800 family), Analog 
Devices (SHARC family), and Texas Instruments (TMS320 family), among others. The 
TMS320C6201 DSP device was selected for use in the development project supporting 
this thesis since it was readily available. We shall focus on algorithm optimizations which 
may be appropriate for this device. 

In the development of a new signal detection system, we considered “how can DSP 
devices be efficiently used for spectrum analysis?” This thesis explores a variety of design 
issues in the development of an FFT-based spectrum analyzer. Initially, we describe varia- 
tions on the theme of FFT algorithm implementation, and show how the run time of a 
“textbook” algorithm can be reduced by a factor of eight while preserving the portability 
of C language. Then we describe design choices for spectrum analysis window implemen- 
tation and optimization of an FFT algorithm for the TMS320C6201 DSP device manufac- 
tured by Texas Instruments Incorporated (TI). (This processor and related products from 
TI are referred to below simply as the “C6x” where such usage does not cause confusion.) 

Though FFT algorithms have been derived for data vectors of arbitrary length, the 


FFT algorithms described in this thesis are restricted to those which process data vectors 
of N elements, where N=2*, with k an integer. 
The notation “1K” refers to 1024 = a and “1K2” refers to 1024? = 279 = 


1048576. Let i denote /-1. 


Il. THE FAST FOURIER TRANSFORM DESIGN SPACE 


This chapter explores FFT algorithm design issues which apply to any implementation, 
whether general-purpose computer, digital signal processor, or custom hardware. Such 
issues include minimizing the number of arithmetic operations, trading fast addition for 
slow multiplication, minimizing the memory space needed, opportunities to trade memory 
space for execution speed, and optimizing cache memory efficiency. 
A. ARITHMETIC 

The DFT and Inverse DFT are defined by a pair of mathematical formulae [Ref. 
3:p. 406, 407] which can be translated into arithmetic operations (multiplication and addi- 


tion) in a straightforward way. 


N-1 
Fy = 3 fee” (2.1a) 
j=0 
1 N-1 
—i2Njk/N 
ae 2b 
k=0 


For the purposes of this paper, f; can be regarded as a complex-valued sampled 
time series of length N, F;, as a complex output sample from the k-th bandpass filter, 


i2mjk/N er ee : ; ; 
go as a rotation in the complex plane which is proportional to time (j) and frequency 


(k). Equation 2.1a is the “Forward” DFT; equation 2.1b, the Inverse. Arfken notes that the 


equations can be made symmetrical by distributing the “1/N” factor shown in 2.1b across 


both equations as 1/(/N) [Ref. 4:p. 789]. 

Unfortunately, the straightforward algorithm suggested by equation 2.1a requires 
N complex multiplications and additions for each of the N output values, so the number of 
arithmetic operations is proportional to N°? for the complete transform. All “fast” algo- 
rithms are roughly proportional to N*log>(N), eliminating approximately 99% of the work 
for a 1K-point transform, and 99.998% of the work for a 1K?-point transform. However, 


the constant of proportionality can vary significantly with implementation. 


1. Minimizing Complex Multiplications 
The heart of the FFT is equation 2.2, which illustrates the radix-two algorithm 
[Ref. 3:p. 408]. We compute two transforms of size N/2 (using even indexed samples for 


i2nk/N 
Ios EF eae © Fodd,k (2.2) 


one, odd indexed samples for the other), combined into one transform of size NV. Computa- 
tion of F.,¢, and Fjgq can be accomplished by computing four transforms of size N/4, and 
so on, until the transform size is reduced to one (which is no transform at all). Each stage, 
such as the one above, requires N/2 complex multiplications and additions (one for each 


odd value of k), and there will be log,(V) such stages. Thus, the transform of size N is 
computed with approximately N*log>(N)/2 complex multiplications. This is illustrated 
below for N = 8. [Ref. 1:p. 300]. Each stage has four complex multiplications (the multi- 


pliers being denoted as Wy), and there are three stages. 
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Eight-point FFT Flow Graph. 





2. Minimizing Real Multiplications 
Whether or not the programming language used for an FFT algorithm supports 
complex numbers, complex multiplication must eventually be implemented with real 
arithmetic. The term e797” (w,/*) becomes (Cos(2 1 j k/N) +i Sin(2 tj k/N )); xp 
becomes (Re{x,,} + 7 Im{x, }) (the sums and multiplications by 7 being complex number 
notation, of the form ‘“‘a + i b’, rather than actual arithmetic). 
a. Alternative Expressions for Complex Multiplication 


The conventional way to calculate a complex product with real arithmetic 


is shown in Equation 2.3. 
(a+ib)-(c+id) = (a-c—b-d)+i(b-c+a-d) (2.3) 


However, when multiplication is more time consuming than addition, an 
alternative form may be advantageous.[Ref. 5:p. 430]. 


(a+ib)-(c+id) = (((a+b)-c)-b:(d+c))+i(((at+b)-c)+a-(d-c)) (A) 


The underlined common subexpression in Equation 2.4 reduces the number 
of multiplications from four to three, at the cost of increasing the number of add/subtract 
operations from two to five. This may a worthwhile change if multiplication takes more 
than twice as long as addition. From a parallel pipelined processing perspective, the con- 
ventional expression can complete in two stages (if four multiplication units are available), 
while the alternate expression requires at least three stages, but only three multiplication 
units. Thus, deciding which code will run more quickly requires a detailed knowledge of 
the processor resources. 


b. Radix-four Algorithms 


When N = 4, a radix-four algorithm will be slightly more efficient than the 
radix-two algorithm sketched above. Instead of dividing the input vector into two sub- 
sequences, it is divided into four sub-sequences, so only log4(N) stages of processing are 
needed. This can reduce the amount of data traffic between the processor registers (or 
cache) and data vector memory by half, assuming that the processor has enough registers 


to keep intermediate results close at hand. Multiplication by i can be implemented by sim- 


ply interchanging real and imaginary components, then negating the real component, and 
Wy? is always equal to one so the number of arithmetic multiplications is reduced. A 


flow-graph for a radix-four calculation is shown below [Ref 1:p. 317]. 
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Figure 2. Four-point, Radix-four FFT. 


3. Complex Exponential Constants 

The “N complex roots of unity” constant factors e PoR TEL typically denoted as 
Wy! K and refered to as “twiddle factors”, offer a variety of options to the algorithm 
designer. To call a math library function each time a constant is needed is simple and accu- 
rate, but slow. In e i2NjK/N , 2, T, i, and N are constants, and j and k are integers less than 


N, so the robust generality of a math library call is rarely needed. 


One option is to note that all of these factors are Wj raised to an integer power, and 


can be used in ascending order. Once Wy is known (e.g., provided by a math library func- 


tion), successive constants can be obtained iteratively, as in the following pseudo-code 


fragment [Ref. 6:p. 24]: 


complex w_increment = exp(i 2m / N); 
complex w_jk = (1, 0); /* w to the zeroth power is 1. */ 
for k = 0 to M 

(use w_jk for an FFT calculation) 

w_jk = w_jk * w_increment; /* recursively update w */ 





Figure 3. Recursive Update of Complex Exponential Constants. 


Now only one library function call (or tabulated value) is needed for all M values. 
However, the recursive formula allows numerical rounding error to accumulate propor- 
tionally to the transform size (especially if fixed-point arithmetic is used). 

If many transforms of some particular size N are to be computed (as they will be in 
a spectrum analyzer), we can pre-compute the whole set of constants Cos(2 mj k/N) and 
Sin(2 tj k/N) just once. This occurs when the algorithm is initialized, although the pro- 
gram then requires additional memory for the table. If the algorithm is being designed for 
a predetermined value of N, the constant tabulation can be done when the program is com- 
piled. 

How many values need to be tabulated? One way to answer this question would be 
to look at the entire algorithm to determine the maximum product of j and k, but this 
would give a pessimistic result. Since Sin(x) and Cos(x) are periodic functions, if we can 
map all products of j and k into the interval [0..N-1], we only need to tabulate Cos(2 1 k/ 
N) for k=0 to N-1. Since Cos(21-x)=Cos(x), Sin(x+1/2) = Cos(x), and so on, we may be 
able to minimize the use of memory by carefully indexing into a smaller table. On the 
other hand, complicated indexing logic may impose an intolerable burden on the arith- 
metic processor, especially if it involves time-consuming branches in the control flow. 

With what precision do the values need to be tabulated (or calculated)? Floating- 
point arithmetic typically provides 24 or 48 bits of precision. Integer arithmetic, as used 
by many DSP devices, could plausibly use 8, 16, and 32 bits. Using the rough rule-of- 


thumb that each bit of precision provides six decibels of dynamic range, 8-bit values 


would add quantization noise at the -48 dB (relative to full scale) level, 16-bit values pro- 
vide -96 dB, and 32-bit values provide -192 dB. For our application, -48 dB would be 
excessively noisy, but -96 dB is sufficient. Thus, either 16-bit integer (““short” in ANSI- 
C) or single-precision floating-point were acceptable. 

As we will see in Chapter III, using 16-bit integer Cos() and SinQ) constants allows 
them to be interleaved in memory such that both can be loaded into a 32-bit register with a 
single instruction, and makes efficient use of the DSP device’s 16x16 multiplication unit. 
Tabulating one entire cycle (N values) for each function allows simple (fast) data indexing, 
although it requires eight times as much memory as the more complex indexing scheme 
described above. 

4. Fixed-point vs.Floating-point Data Representation 

The hardware for floating-point arithmetic is inherently more complicated than 
that for integer arithmetic. To perform floating-point addition, the exponent terms of each 
value must be made equal, the mantissa shifted as appropriate, the mantissas added, then 
the result normalized. In multiplication, the mantissas are multiplied and the exponents 
added, after which limited (one or two bits) re-normalization is needed [Ref. 2:p. 296]. 

Analysis of data representation parallels that of coefficient precision. Though sen- 
sor data may be of only eight or twelve bits, 16-bit integers are efficiently processed and 
provide adequate accuracy in our application. 

We can contrast the arithmetic performance of two processors from TI’s C6x fam- 
ily. The TMS320C6201 does only integer arithmetic, while the TMS320C6701 also does 
floating point. These devices have otherwise identical architectures and process technolo- 
gies. The 16x16 bit integer multiply completes in two clock cycles, while the 32x32 bit 
floating-point multiply completes in four clock cycles. In each case, a new instruction can 
start the multiplier pipeline on every clock cycle, but the added latency of the floating- 
point operation increases the likelihood that algorithm dependencies will prevent continu- 
ous utilization of the functional unit. Furthermore, the clock period (at introduction in 
1997) of the ‘C6201 was five nanoseconds, while that of the ‘C6701 was six [Ref. 7: Mod. 
1: p. 37]. (We'll discuss the “C6x architecture in more detail in Chapter III.) 


5. Integer Result Scaling 

The DFT algorithm in the formula above consists of a series of complex multipli- 
cations and additions, which implies that the maximum magnitude of any output value 
may be as large as N times the maximum allowable input value. (A typical worst-case 
input vector has all values equal to max + i max. Fo is then just N*(max + i max).) Ifa 
fixed-point data representation is used, the output word may require Jog>(N) more bits than 


the input word to contain the increased magnitude. To be specific, the 4K-sample trans- 


form requires 12 more bits, the 1K? transform 20 more bits, and the 4K? point transform 
22 more bits. Unfortunately, our ‘C6x processor can most efficiently multiply 16-bit oper- 
ands, so overflow avoidance would seem to require scaling its input data to just four bits 
for the smallest transform of interest and would make the larger sizes infeasible. 


This problem can be addressed by modifying the DFT equation as follows: 


N-1 
1 eka 
F, = ae y el2Kjik/N . fj (2.5) 
j=0 
or 
es, 
Fy, ad - y el2Njk/N . fj (2.6) 
j=0 


We can compensate for these scaling constants when interpreting the spectrum 
analyzer output, and they allow us to distribute the scaling of intermediate results across 


the stages of the FFT. (If forward transforms were followed by inverse transforms, a com- 


pensating change would be needed to the IDFT definition.) Since NV =2* and we have k 
stages of processing, we can implement the second equation above by dividing the result 
of each stage of a radix-two FFT’s addition by two, which is simply a one-bit right shift. 
Alternatively, the summands to the addition can be scaled. When the summands are 
formed by an integer multiplication, the scaling that renormalizes the multiplication result 
can be modified to incorporate summand scaling with no performance penalty. 

With a radix-four algorithm, the designer has more flexibility. Four terms are 


added together at each stage, which produces a potential word-length growth of two bits 


per stage. Right-shifting by two at each stage prevents overflow (as in Eq. 2.6), while 
right-shifting by one (as in Eq. 2.5) allows word-length growth of log4(N). If the input 
data is acquired with eight-bit precision, the number of significant bits can be allowed to 
grow to fifteen (not including the sign). Without intermediate term scaling, the maximum 
FFT size without overflow would be 128. With single-bit intermediate scaling, the maxi- 
mum size is 32K; with two-bit intermediate scaling, overflow is impossible. 

Bear in mind, however, that the dynamic range of the output is still limited to at 
most 16 bits. Consider the fate of an input data vector containing a single non-zero sam- 
ple, at index 0. Without scaling, the non-zero value is duplicated to every element in the 
output vector. With single-bit scaling, its value is reduced by half at each stage, so a value 
of 128 vanishes after seven radix-four stages (16K-point FFT). With two-bit scaling, it dis- 
appears in the fourth radix-four stage (256-point FFT). 

Whether or not distributed scaling is appropriate will depend on the application. In 
some cases, it might be feasible to examine the result of applying conservative two-bit 
scaling; if no spectral lines are found which have sufficient magnitude to cause overflow, 
apply a transform with one-bit scaling, or no scaling at all, to the same data set to achieve 
greater accuracy. If the processor has an overflow flag which is updated with the status of 
every addition (the “C60 does not), responding to an overflow condition would require 
either time-consuming instructions to test the flag or a hardware interrupt to modify the 
flow of the algorithm. 

B. MEMORY 

As arithmetic logic technology has increased in speed and complexity, the time 
required to move data between memory and arithmetic units has become increasingly sig- 
nificant. Regardless of advancing technology, relatively fast memory is relatively expen- 
sive, and many algorithms (with the notable exception of large FFTs) have been found to 
require access to a small fraction of the total memory for much of the total time. Thus, 
most computer memory is organized in multiple levels. The processor’s register file mem- 
ory provides the most rapid transfers to and from arithmetic logic. Frequently used data 


which cannot reside in the register file resides in a small, fast cache memory which han- 
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dles most of the accesses. It, in turn, is supported by a larger, slower “main memory,” 
which may be supported by virtual memory on a local disk, a remote “file server,’ and/or 
archival magnetic tape [Ref. 2:p. 372]. In this section, we assess the impact of the memory 
hierarchy on FFT performance. 

1. In-place vs. Out-of-place Algorithms 

The most commonly presented FFT algorithms minimize the use of memory by 
operating “in place”; that is, out of N complex elements in the input vector, arithmetic is 
performed using two (or four, depending on the radix) of them, after which the modified 
values are written back into the same memory locations. However, this has the disadvan- 
tage of leaving the output vector “scrambled,” so most applications require an additional 
pass through memory data, after the transform proper is complete, to unscramble the 
result. Rearrangement of the FFT flow graph can produce an algorithm that returns the 
output elements in the proper order, although the algorithm requires two memory buffers 
of length N instead of one. Memory address calculations during the transform may require 
slightly more time, but the total execution time is reduced, since the unscrambling opera- 
tion is unnecessary [Ref. 6:p. 49]. 

2. Window Function Storage 

Spectrum estimation with the Welch method of averaged periodograms requires 
smoothing in the frequency domain, implemented by multiplying the input sequence by a 
“window function.” In the time domain, a window function tapers the magnitude of data 
elements near the ends of the array. One popular window function, the “periodic Hamming 
window,” was invented by R.W. Hamming: h(k) = 0.54 - 0.46 Cos(2 1 k/N), and there are 
a number of variations on this theme. For example, Oppenheim and Shafer [Ref. 1: p.242] 
give the following formula: h(k) = 0.54 - 0.46 Cos(2 1k /(N-1)). This is the “symmetric” 
Hamming window, used for digital filter design, which gives subtly different spectrum 
estimates. (As stated by the authors of Numerical Recipes in C in a slightly different con- 
text, “... if the difference between N and N-1 ever matters to you, then you are probably up 
to no good anyway...” [Ref. 3:p. 473].) Selection of a window function involves balancing 


the frequency-resolution of spectrum estimates, accuracy of power measurements, sup- 
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pression of sidelobe power which can mask weak signals, and computational complexity 
[Ref. 8:p.161]. 

For fastest processing performance, the window function should be tabulated, just 
as the Sine and Cosine factors used within the FFT are. The simplest scheme for tabulating 
the window function is to precompute all N values. The amount of memory needed, 
though, can be cut in half by taking advantage of the symmetry of the function. Depending 
on the balance between memory and arithmetic speed in the system, it may be possible to 
eliminate window-function storage memory by re-using the Cos(2 1 k /N) function tabu- 
lated for use within the FFT itself. If 16-bit integer arithmetic is used, the cosine table will 
be scaled to fill the word: cos_table[k] = 32767 * Cos(2 1 k/N). The integer-scaled Hann 
window can be easily derived from this table: h[k] = 16384 - (cos_table[k] >> 1), where 
“>>” is the ANSI-C “right-shift” operator. The Hann window is also attractive because it 
provides lower spectral leakage far from a strong spectral line than the Hamming 
window does, at the expense of a slightly higher close-in sidelobe, as shown in Fig. 4. The 
signal to be analyzed consists of three sinusoids, of randomly selected frequency and 
unequal power. The weak signals are nearly hidden by the leakage of the Rectangular and 


Hamming windows. 


12 








— Rectangular 
~ Hann 
oO eet Hamming 








power 
| 
(oe) 
oO 
T 
i 











—1 60 ; | | | | | | 
0 20 40 60 80 100 120 140 
frequency 


Figure 4. Spectrum Leakage for Three Window Functions. 


If one of the more complex window functions was selected (Kaiser-Bessel or 
Dolph-Chebyshev, for example [Ref. 8:p.194]), we could conserve memory by tabulating 
a decimated window function, then interpolating it as needed. Since window functions are 
relatively smooth, we need only tabulate the even-indexed elements, and “hold” each tabu- 
lated value for the following odd data element (the simplest possible interpolation). If we 
do this, we’ll find that spurious signals appear within the output spectrum estimate as 
shown in Fig. 5, for the following reason. Consider the interpolated window function as 
the sum of the true window function and an error function. The error function will be zero 
for even elements, but will be non-zero and proportional to the first derivative of the win- 
dow function for odd elements. Thus, the spectrum of the error function will contain a dis- 
crete component at one half of the sample rate. Multiplication in the time domain 
corresponding to convolution in the frequency domain, we find that the spectrum estimates 


produced with decimated window functions are convolved with the spectra of both the true 
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window function and the error function. Spurious spectral lines are the result, as shown at 


the left side of Fig. 5. (The three-sinusoid input signal is the same as for the prior figure.) 
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Figure 5. Decimated Window Functions Produce Spurious Spectra. 


Quantitatively, the relative power of the spurious signal depends on the size of the 
FFT. As the size increases, the magnitude of the first derivative, and thus the amplitude of 
the error function, decreases. The spurious signal is 38 dB below the true signal for 
N=128, and it falls by 6 dB each time N is doubled. 

If we use linear interpolation, instead of holding the prior value, the error function 
will be proportional to the second derivative of the window function. The spurious signal 
is then 71 dB below the true signal for N=128, and falls by 12 dB each time N is doubled. 
Trading off window storage space, interpolation complexity, and spurious signal levels can 


be done in the context of a specific application. 


14 


3. Unscrambling the Output 

An in-place, decimation-in-frequency algorithm produces a scrambled output vec- 
tor. That is, the transformed elements are arranged in memory as if sorted by a key which 
is the element index, reversed such that the most significant bit is in the least significant 
position. For a 256-point radix-two algorithm, element 0 will be in memory location 0, but 
element | (binary 0000 0001) will be in location 128 (1000 0000), element 2 (0000 0010) 
will be in location 64 (0100 0000), element 3 (0000 0011) in location 192 (1100 0000), 
and so on. Since we want to observe features of the signal spectrum which span more than 
a single FFT output value, we need an efficient way to re-order the data by frequency 
index. 

Consider the data re-ordering algorithm for a radix-two algorithm. The entire vec- 
tor can be re-ordered by incrementing the index through the vector, comparing the index 
with its bit-reversed value, and swapping data elements when the reversed index is larger 
than the original. (When the reversed value is smaller, the swap has already been done for 
this pair.) 

Gutman [Ref. 9] describes an elegant algorithm for reversing bits, which can be 


coded in C for a 16-bit word (“k’’) as follows: 


a = ((k & OxOOFF) << 8) | ((k >> 8) & OxOOFF); 
b = ((a & OxXOFOF) << 4) | ((a >> 4) & OXOFOF); 
c = ((b & 0x3333) << 2) | ((b >> 2) & 0x3333); 
d = ((c & 0x5555) << 1) | ((c >> 1) & 0x5555); 


Figure 6. Reversing the Order of Bits in a Word. 


In C, “&” is the bit-wise “and” operator, and “|” is bit-wise “or”. The first line 
swaps the high byte with the low byte. The second swaps the four least significant bits 
with the four most significant bits in each byte, and so on, until the last swaps individual 
bits. Note that this algorithm is free of branch instructions and recursive assignments, and 
so a pipeline is effective in speeding up its operation. 

For a radix-four algorithm, the bits of the index are swapped as two-bit digits. If 


we denote the eight-bit index as “b7b¢b5b4b3b 2b ;bp,” the scrambled index is 
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“Db  bob3b2b5b4b7b,” (not “bob; byb3b4bs5b¢b7,” as for radix-two). To reverse digits as 
required for a radix-four algorithm, we can simply omit the last line of Fig. 6. 

The algorithm is straightforward for index values which fill a power-of-two word 
size (1.e., 8 bits: N=256, 16 bits: N=64K), and can be extended for arbitrary lengths. For a 
N=2K FFT, the index will be 11 bits wide, and the following sequence can be used to 
reverse the bits: 


a 
b 
c 


((k & OxO01F) << 6) | (k & 0x0020) | ((k >> 6) & OxOO1F); 
((a & O0x00C3) << 3) | (a & 0x0124) | ((a >> 3) & Ox00C3); 
((b & 0x0249) << 1) | (a & 0x0124) | ((b >> 1) & 0x0249); 


Figure 7. Reversing an 11-bit word. 


Courtney, at Texas Instruments, has published a different algorithm for sorting 
FFT result vectors [Ref. 10], which relies on a lookup table to determine the bit-reversed 
index. When optimized in assembly language, it runs in approximately (N / 4) * 7 clock 


cycles on their VLIW processor. 


4. Locality of Reference (Improving Cache Effectiveness) 

The basic FFTs access memory according to a pattern which may prevent effective 
use of conventional computer cache memories. The signal flow graph for a 16-point, in- 
place, radix-four FFT is shown below in Fig. 8. There are two stages in this decimation-in- 
frequency algorithm, each consisting of four four-point FFTs. Each four-point FFT is 
referred to as a “butterfly” calculation (though perhaps they look more like spiders in this 
figure). The input vector is arranged sequentially in memory (m;= x;); the output vector is 
scrambled. 

Input values to the first stage are read from scattered locations. In this 16-point 
transform, the first four samples to be processed come from locations 0, 4, 8, and 12; ina 
4K-point transform, the first four samples would be 0, 1024, 2048, and 3072; for a 1K?- 
point transform, 0, 262144, 524288, and 786432. The cache management [Ref. 2:p. 344] 
assumption of “spatial locality,” that most memory accesses tend to occur near recently 


used locations, is violated for all but the last few stages of a large transform. The second 
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cache management assumption, “temporal locality” (that memory which has recently been 
accessed is likely to be accessed again soon), is also violated. Once x, for example, has 
been read once and written once in stage 1, it won’t be read again until all other elements 
of the array have been read (once) and written (once) in stage 1. The impact of violating 
these assumptions will be demonstrated experimentally below. If the entire FFT data vec- 
tor (and any tabulated constant factors) fits within the primary cache memory, the scat- 
tered access pattern does not affect the access time, but exceeding this size can have a 


significant impact on performance. 
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Figure 8. Memory Access Pattern, Highlighting the First Butterfly. 


The solution to this problem is to decompose a large FFT into a series of smaller 
transforms, each of which can fit within the cache, and each of which is self-contained 
with respect to the rest of the data vector. To illustrate this idea, consider the 16-point data 
vector shown in Fig. 9, arranging the 16 input points into a four by four grid. 


The first stage of the 16-point FFT in Fig. 8 applies a four-point FFT to each col- 
umn of the left grid shown in Fig. 9. Then each element is multiplied by Wy k (where j and 


k represent the row and column indices), before a four-point FFT is applied to produce 
each row of the right grid in Fig. 9. Unscrambling of the data can be accomplished by 


reading down the columns of the right grid. This idea can be extended to much larger 
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SIZES; a 1K?-point FFT can be decomposed into twenty stages of radix-two FFTs, ten 


stages of radix-four FFTs, or two stages of “radix-1K” FFTs. 
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Figure 9. Reshaping the Data Vector. 
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Little would be gained if the small transforms operated on the widely-scattered 


data elements of a single column in main memory, but it is simple to “gather” these ele- 


ments into a small work buffer, perform the transform, and then “scatter” the transformed 


elements back into their original positions (some vector processors have vector-gather and 


vector-scatter instructions). When all data for a single radix-1K FFT can fit within the pri- 


mary cache of the processor, the radix-1K FFT can then be decomposed into five radix- 


four stages which run at the full processor speed. 


Even greater efficiency can be achieved, however, if we transpose the entire data 


set before performing the small transforms [Ref. 6:p.139], as shown in Fig. 10. 
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Figure 10. 16-point FFT Built from Four-point FFTs and Matrix Transposition. 
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5. Distributed Parallel Implementation 

The algorithm described above suggests an algorithm for distributing computation 
on a multiprocessor architecture, since each of the row transforms can in principle be done 
by a separate processor. Consider a system architecture which must accept N input sam- 
ples at a rate which may exceed the bandwidth of a single memory bus. High-speed logic 
can distribute samples to B memory buses, which implicitly performs the first matrix 
transpose operation. Once all N samples for a given observation interval are stored in 


memory, each of the B processors simultaneously performs an FFT of size N/B. Then, an 


interprocessor matrix transpose is followed by the twiddle-factor multiply and N/ B’) 
transforms of size B on each of the B processors [Ref. 6:p.173]. The result can then be 
read from the B memory systems in round-robin fashion to implicitly perform the final 
transpose. 

As an approach to increase the bandwidth of a spectrum analyzer, the fast-sam- 
pling distributed FFT can be contrasted with a frequency-domain “divide and conquer” 
approach. Rather than distributing samples from a single analog-to-digital converter 
(ADC) across multiple processors, we can use a channelizing architecture to distribute 
subranges of the input bandwidth to independent spectrum analyzers. Channelizing can be 
performed with traditional analog electronics (oscillators, mixers, and filters) followed by 
a relatively slow ADC, or with one or more relatively fast ADCs followed by digital oscil- 
lators, mixers, and filters in hardware or software (as on the Pentek 6216 Dual Digital 
Receiver Module [Ref. 11]). 

The channelizing architecture allows each processor to operate independently, 
without the synchronization and communication complexities of the distributed FFT 
implementation. However, the fast-sampling distributed FFT architecture may be favored 
if we also wish to analyze in the time domain those detected signals which may occupy a 


large fraction of the searched bandwidth. 
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6. Cache-efficient Transpose Operation 
The algorithm described above transposes the full data set three times. The sim- 
plest way to perform such a transpose is with the following code, in which the two-dimen- 


sional character of the array is implicit in the read and write index calculations: 


for (read_row = 0; read_row < read_rows; read_row+t+) { 
for (read_col = 0; read_col < read_cols; read_col+t+) { 
write_col = read_row; 
write_row = read_col; 


y[write_row * read_rows + write_col] = 
x[read_row * read_cols + read_col]; 


Figure 11. Simple Transpose Source Code. 


This algorithm allows the cache controller to load multiple elements from sequen- 
tial memory locations, so the read miss rate is low. However, write operations are scattered 
throughout the result array, so a new cache line must be read, updated, and written back 
for almost every result. This inefficiency can be avoided by transposing blocks of elements 
[Ref. 6:p. 129]. When a block transpose of 16 by 16 complex floating-point elements can 
be done completely in the cache, each cache write operation will store 16 transposed ele- 
ments (128 bytes), instead of one. Portable source code for this algorithm can be found in 


the Appendix. 


C. EXPERIMENTS WITH PORTABLE PROGRAMS ON RISC 

In this section, we examine some portable implementation codes for the FFT 
which employ some of the techniques described above. 

1. Test Method and Conditions 

To measure the execution time of a code, we used the UNIX “time” function, 
which provided three figures: the total elapsed time for the program to run, the amount of 
time that the processor was running the user program, and the amount of time that the pro- 
cessor was handling operating system functions required by the program. The “user” CPU 


time was used for measurements below. 
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The computer used for testing was a Silicon Graphics Indigo, manufactured in the 
mid 1990s, with a 100 MHz MIPS R4000 Reduced Instruction Set Computer (RISC) pro- 
cessor with floating-point coprocessor. Main memory capacity is 64 Mbytes, with a single 
1MB unified secondary cache, and dual 8KB primary caches (one for data, one for instruc- 
tions). The Silicon Graphics C compiler was used to compile the test programs, using -O3 
(the maximum) compiler optimization flag. 

2. Test Results 

The following table illustrates the variation in performance between various imple- 
mentations. In most cases, a large number of transforms was done to get a measurable run- 
time. When the run-time is given as a sum, the first part of the sum is the time needed for 


initialization, while the second part is the execution time for “one more” FFT. 


Table 1. Execution Times of Portable FFT Algorithms. 
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Algorithm “fourl” is taken from Numerical Recipes in C [Ref. 3:p.411], slightly 
modified to use only single-precision floating-point arithmetic. It is a “radix-two, decima- 
tion-in-time, Cooley-Tukey” algorithm, and the time includes the bit-reversed index sort 
process. 

Algorithm “St” is a radix-four Stockham out-of-place transform [Ref. 6:p. 105], 
with in-line calls to the math library for multiplier coefficients. In each stage, data is trans- 
formed from the input/output buffer to a work buffer, then (when the stage is complete) the 
contents of the work buffer are copied back to the input/output buffer in preparation for the 
next stage. No data reordering is needed with this algorithm. It is slightly faster, in most 
cases, than the “four!” algorithm. 

Algorithm “St-T” improves on “St” by tabulating trig functions. (For small trans- 
forms, the time required to initialize the tables is too small to measure meaningfully.) 

Algorithm “St-F” improves on “St-T” by performing butterfly calculations on odd 
stages from the data input buffer to a work buffer, and on even stages from the work buffer 
back to the input buffer. For values of N which are an odd power of four, the result is 


returned in the work buffer. 


Algorithm “Fact-T” factors N into two sets of smaller FFTs. For example, the [kK 
point transform is calculated by conceptually reshaping the data vector into an array of 
4096 columns by 256 rows (both even powers of four, for the convenience of the St-F 
algorithm used for in-cache transforms). The array is transposed to 256 columns of 4096 
rows, so elements which were separated by 4096 are now adjacent in memory. The 256 
“St-F” FFTs of size 4K are followed by an element-by-element multiply by the complex 
constants, then another transpose. After 4096 FFTs of size 256 and another array trans- 


pose, the complete transformed data is properly ordered in memory. Since the array trans- 


pose operation is out-of-place, this algorithm requires a transpose work buffer of 1K? 
elements. (An out-of-place transpose is faster than an in-place transpose, as well as being 
simpler for non-square arrays.) The tabulated constants also occupy 1K? elements, and the 
St-F algorithm uses a work buffer of 4096 elements. Thus, the total memory demand for 


this algorithm is roughly three times greater than for any of the others. 
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Algorithm “Fact-B,” not listed in the table, reduces the memory demand of Fact-T 


by eliminating the matrix transpose (and the transpose buffer). When, for example, a 256- 


point FFT is to be done with elements scattered throughout the 1K?-point array, they are 
first gathered into a 256-element buffer. For the million-point transform, this algorithm 
takes 10 seconds. Fact-T is faster because it transposes multiple columns of the array at 
the same time, while Fact-B only “transposes” the one that’s currently needed. 

Algorithm “FFTW” was obtained via the Internet from “The Fastest Fourier Trans- 
form in the West” project sponsored by the Massachusetts Institute of Technology [Ref. 
12]. This program measures the performance of various FFT components to automatically 
synthesize an FFT algorithm which is “nearly optimal” for the current processor, whatever 
the FFT size and processor happen to be. Note that the time needed to analyze (either 
through estimation or actual measurement) and synthesize is shown as the first part of the 
sum in the FFTW column, and its units are always seconds. As we might expect, after 


we’ve paid the start-up penalty (which is substantial), FFTW demonstrates excellent per- 


formance for all but the largest transforms. For the 256K and 1K? FFTs, though, it is much 
slower than Fact-T. On the other hand, it does not demand as much memory as Fact-T. 
Whether this was a conscious tradeoff or not is unknown, but memory is cheap and time is 
priceless. 

To help illustrate the impact of cache inefficiency, we compare actual processing 
times with a simple model. Column “‘St-F model” assumes that memory access time is 
irrelevant, and so execution time is estimated using C*N* log4(N), where C is a constant 
of proportionality (C=4.297e-7) based on the “St-F’ time for N=1024. The model seems to 


be reasonably accurate as N increases from 64 to 16K, but the run time exceeds the time 


predicted by the model as the cache miss rate increases. By the time N reaches 1K?, St-Fis 

taking 5.3 times as long as predicted. Fact-T, however, only takes 1.7 times as long, indi- 

cating that the transform factoring process is effective at masking cache limitations. 
Column “FFTW-model” is analogous to St-F model, using the measured optimiza- 


tion numbers. Again, we see that memory bandwidth limitations make FFTW take 
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roughly four times as long as predicted by the model. Put another way, 75% of the CPU 


performance is wasted just waiting for data. 


Figure 12 illustrates the performance degradation imposed by memory bandwidth 


limitations. The plot labeled “St-F / St-F model” is the ratio between actual St-F time and 


the St-F model. The “FFTW / FFTW model” plot is analogous, while the “Fact-T /St-F 


model” plot is the ratio between the Fact-T time and the St-F model (since the Fact-T 


algorithm uses the St-F FFT internally). For small transforms, the transpose operations 


uselessly rearrange elements already in the cache, so Fact-T takes longer than St-F. 
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Figure 12. Cache Impact on FFT Performance. 


The run time for the Fact-T algorithm can be modeled as Tyot4) = Tinit + 3 * Tirans- 


pose 


+ N, = TRRTN2 + N> 7 TERTNI: For a 1K? FFT, we can use N,=N>=1024, or N,=256 


and N»,=4096. We can get approximate values for Tppy from the upper half of Table 1. 


Using values from Table 1, we expect a factoring algorithm which used FFTW, instead of 
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ST-F, for the small FFTs to cut another two seconds (% 26) from the time required for 


Fact-T to transform 1K? points. 
D. PORTABLE PROGRAMS ON PENTIUM-III AND RISC. 

This section compares the performance of a 733 MHz Pentium-III processor and a 
100 MHz MIPS R4000, using the “St-F”’ program described above. 

1. Test Conditions 

The “St-F” program was recompiled and executed on a 733 MHz Pentium-III pro- 
cessor with 1 GByte of main memory. The compilers (Microsoft Visual C++ V6.0 and 
Intel C/C++ V4.5) were configured to optimize for maximum speed of execution. We also 
measured the complex-float, not-in-place FFT routine found in the Intel Signal Processing 
Library, with the Intel C compiler. To enable comparisons with the fixed-point arithmetic 
implemented in the DSP device in the next chapter, the code was also modified to work 
with short (16-bit) integers (with internal scaling to avoid overflow). Each routine was 
called enough times to allow convenient measurement with a stopwatch, so each test ran 
for 5-30 seconds. Since each ran at least ten iterations, table initialization time was insig- 


nificant and is not listed below. 
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2. Test Results 


The run times are shown in the table below. 


Table 2. Pentium-III Run Times (all millisec). 






































N float short float float 
log4N | Microsoft | Microsoft | Intel C | Intel C | Intel library 
64 
3 0.0068 0.0078 0.0059 
256 
4 0.032 0.036 0.027 
1024 
2 0.16 0.17 0.14 
4096 
6 0.75 0.85 0.66 
16K 
7 4.7 3.9 3.3 
64K 
8 41 32 38 
256K 
9 193 153 178 
1K? 

10 840 730 740 630 730 





From Table 2, we see that Intel’s C compiler produces code which is slightly faster 


than Microsoft’s. 


The FFT routine in Intel’s Signal Processing Library is much faster, for small FFT 


sizes, than our portable C code. This is probably due to Intel’s expert optimization of the 


parallel “MMX”’ instruction set architecture. 


For small transforms, FFTs which use short integer arithmetic are slower than 
those which use floating-point arithmetic. Given that floating-point arithmetic is more 
complicated than integer, this may come as a surprise, but Intel has invested in good float- 
ing-point performance in the Pentium-III. In the floating-point version, data calculations 
can be executed (in floating-point hardware) while address calculations take place in inte- 


ger hardware; in the integer version, the integer hardware must perform both tasks. 
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Figure 13. Library and Integer Performance, Relative to Portable Code. 


For large FFTs, the integer version runs faster than the floating-point version. This 
is probably due to the reduced memory bandwidth required to update the cache, since each 
ANSI-C float value takes four bytes, while each short integer value only takes two. 

While the library routine slows down the most as the FFT size becomes large, it 


remains slightly faster than the portable C code. 
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Figure 14. Cache Impact on Pentium-III Performance. 
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We see in Table 3 that, even after compensating for the difference in clock speeds, 


the Pentium-III system is at least twice as fast as the R4000 RISC system when running 


the floating-point portable C program. The 1K and 4K sizes illustrate the in-cache perfor- 


mance, while the 1K” size shows that the full memory system is more efficient in the Pen- 


tium-III system than that in the R4000. 


Table 3. R4000 vs. Pentium-III Performance. 
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FFT size R4000 Pentium-III ratio clocks? se 
comp. ratio 
1K 2.2m 0.14m 15.7 2.14 
4K 11m 0.66m 16.7 2.28 
1K2 24s 0.74s 32 4.4 





This chapter illustrates several factors that affect the performance of FFT spectrum 
analysis on general-purpose computers: algorithm design, compiler technology, expert 
optimization (in the Intel Signal Processing Library), and CPU architecture. Bear in mind, 
though, that the R4000 predates the Pentium-III by roughly five years, so it must not be 
taken to represent the current state of RISC technology. 

In the development of a signal processing system, the flexibility of portable pro- 
grams on general-purpose computing hardware is of little importance, since the system 
will be dedicated to a specific process. If special-purpose hardware can accelerate this pro- 
cess, without imposing excessive development costs or delays, it should be included in the 
design. Since the FFT is a well known and important component of digital signal process- 
ing, we expect that specialized digital signal processing devices can provide such acceler- 
ation. One such device, the TMS320C6201 (by Texas Instruments), is examined in 
Chapter III. 
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Ill. DIGITAL SIGNAL PROCESSOR APPLICATION 


A. OVERVIEW OF DIGITAL SIGNAL PROCESSING 

Digital signal processing generally involves the algorithmic transformation of a 
sequence of measurements into some structure that is more useful. For example, the pro- 
cessor in a modem transforms a sequence of binary digits from a data terminal into a 
sequence of numbers representing a waveform that can pass through some communica- 
tions medium, and vice versa. The processor in a digital cellular telephone transforms 
numbers representing the speech waveform from a microphone into bursts of numbers 
which satisfy the multiple-access communication protocol of the system, and reconstructs 
conversation from received bursts. 

Though we often think of the input to a signal processing algorithm as a time- 
series from a single sensor, acoustic beamforming (used in SONAR) and radio direction- 
finding involve combining the outputs of an array of sensors. 

In commercial signal processing applications, a single program may be developed 
to execute on thousands or even millions of processors, such as those found in cellular 
telephone handsets. Accordingly, economic forces tend to favor minimizing the unit hard- 
ware cost (with adequate performance) over ease of programming, since the cost of devel- 
opment will be shared by all buyers. Once integrated into a product, most DSP 
applications will execute without change for the life of the product. Both of these forces 
push toward a “translate slowly; run quickly” characteristic which is more tolerant of 
architectural innovation than the general-purpose computing market. The developer of a 
new DSP device need not be concerned with maintaining compatibility with past genera- 


tions of application or operating system programs. 
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B. DESIGN FEATURES OF THE TMS320C6x FAMILY 

Recent advances in computer architecture (apart from multiprocessor parallelism) 
have moved toward performing micro-operations in parallel. A simple pipeline can fetch 
instruction(k) while decoding instruction(k-1), loading operands for instruction(k-2), per- 
forming arithmetic for instruction(k-3), and storing the result from instruction(k-4) to 
memory. However, this assumes that there are no dependencies between these instructions. 
If an operand for instruction(k-3) is stored at the same memory location as the result of 
instruction(k-4), and instruction(k-3) loads the value before instruction(k-4) has stored it, 
then the result of instruction(k-3) may be incorrect. “Superscalar’ processors, such as the 
Intel Pentium family, contain logic which detects dependencies as the program is exe- 
cuted, re-ordering instructions or stalling the pipeline (for example) until dependencies are 
satisfied. Thus, the widely used Intel x86 instruction-set legacy can be executed with 
increasing speed, at the expense of complex hardware and variable timing. 

Programmers of Very Long Instruction Word (VLIW) and RISC architectures 
address the dependency problem during development (design and compilation), rather 
than execution, of the program. This allows the logic which is dedicated to dependency 
analysis to be eliminated. Additional registers, cache memory, and/or arithmetic units can 
be put in its place, or the overall size of the device can be reduced (lowering its cost). 
Instead of automatically detecting the opportunities for parallel processing implicit in a 
sequence of simple, short instructions (typically 32 bits long), VLIW machines employ an 
instruction word which has enough bits to explicitly schedule parallel operations. In the 
C6x series, the instruction word (“execute packet’) can be as long as 256 bits, specifying 
up to eight simultaneous independent arithmetic and/or memory access operations. Addi- 
tional parallelism can be achieved by the use of reduced-precision arithmetic instructions, 
which allow, for example, two 16-bit integer additions to be performed with a 32-bit adder 
by blocking the carry propagation from the low half to the high half. Multiply instructions 
which take as operands either the high or low halves of 32-bit registers also facilitate the 


packing of 16-bit values into 32-bit registers. On the other hand, effective utilization of 
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parallel hardware assets depends on the inherent parallelism of the algorithm, its transla- 
tion into instructions, and the memory access required. 

Major components of the C6x processor are sketched in Fig. 15. Note that a set of 
four functional units is associated with each register file, with two “cross-paths” to allow 
one functional unit from each set access to the opposite register file. Devices in the C6x 
family also include various configurations of on-chip memory, serial ports, timers, direct- 


memory access controllers, etc., but they are of no concern here. 
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Functional Units Registers 


.L1: add, sub, and, or [| Register file A 
(A0-A15) 
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Figure 15. Simplified C6x Processor Block Diagram. 


Figure 16 shows the assembly-language source code for a sample execute packet, 
taken from an FFT benchmark program published by Texas Instruments in C6x assembly 


language. 
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, comments... 


sub -L1 al, a2, a4 ; a4 = al - a2, using functional 
nee, Gl" 

shr -S1 a4, 1, a7 ; a7 = a4 >> 1 (shift right, / 2) 

shr .S2x a4, 1, b9 ; b9 = a4 >> 1 (but on the B-side) 

mv .L2 b6, b0 ; bO = b6é 

stw -D1 al2, *+a0[5] ; store word Al2 to memory at the 


; address five words above the 
; location in AO. 








stw “DZ. :bl2,. e+ LS] 


Figure 16. Parallelism Explicitly Coded Into an Execute Packet. 


The “| |” symbol indicates that the instruction which follows should be performed 
in the same clock cycle as the instruction on the preceding line, so all six of the instruc- 
tions above execute in the same clock cycle. 

The VLIW architecture simplifies the instruction processing pipeline by making 
parallelism explicit during execution of linear sequences of instructions. Very few pro- 
grams, however, run for very long without requiring a deviation from sequential instruc- 
tion fetching which invalidates the partially processed instructions in the pipeline. 
Conventional assembly-language programs perform a comparison, and “conditionally 


branch,” depending on the result of the comparison. To minimize the number of branches 
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in a program, the C6x processors make all instructions conditional, not just branches. For 
example, the integer “‘absolute-value” function can be coded as shown in Fig. 17. 


ANSI C 


if (x < 0) x = -x; 
(next statement) 





Gnu C/assembler for Intel Pentium 





























cmpl $0,-4 (%ebp) compare x (in memory) < 0 
jge .L1 if x >= 0, jump (branch) to label Ll 
negl -—4(%sebp) negate x (in memory) 


-L1: (next statement) 


TI C6x assembler 
ldw .d2 *+SP[0x3], B4; initiate load of x into reg. 
nop 4 ; wait for load to complete 
cmpgt .12 0x0, B4, BO ; if 0 > B4, BO = True 
| | sub -s2 0x0, B4, B4 ; (parallel) B4 = 0 - B4 
[BO] stw .d2 B4, *+SP[0x3]; if BO true, store updated x 


(next statement) 


Figure 17. C6x Branch Avoidance with Conditional Store Instruction. 


The Intel code implicitly fetches the value of x from memory to set a condition 
flag based on the comparison, but doesn’t retain the value in a register. Then, it (condition- 
ally) jumps to the label L1, which may disrupt the instruction processing pipeline. If the 
jump is not taken, x is again implicitly loaded from memory, negated, and rewritten. 
(Whether or not the processor actually performs this exact sequence of operations depends 
on how clever the processor is at interpreting the instruction stream.) 

The C6x code, on the other hand, does not interrupt the sequential pipeline flow of 
instructions; register BO is used as a condition flag and simply prevents storage of the 
negated value of x (in B4). The negation is always calculated, in parallel with the compar- 
ison, but the result may be ignored. (The “nop 4” instruction explicitly stalls the processor 
while it waits for the memory to respond. In other algorithms, it may be feasible to execute 
the “Idw” instruction earlier in the instruction stream, so the delay cycles can be occupied 


by useful instructions.) 
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The impact of branches is also reduced in the C6x by allowing the programmer to 
insert instructions into the “delay slots” which follow a branch instruction (not shown 
above). 

Efficient use of fast arithmetic logic demands a ready supply of operands. At the 
top of the memory hierarchy are the processor’s thirty-two 32-bit registers, which provide 
all arithmetic operands to the arithmetic units. Register addresses are encoded in each 
instruction and access is immediate. A register can serve as both source and destination in 
the same instruction cycle, and a register can be read by as many as four simultaneous 
instructions. 

The next level of memory hierarchy is the on-chip Internal Data Memory (IDM), 
storing 64K bytes (16K 32-bit words). To read from the IDM into a register requires five 
clock cycles, from the time the address is sent (from a register) until the data is written to a 
register. Such data reads are pipelined, though, so two data accesses can be started on 
every clock cycle. Writing to IDM has similar timing. Note that memory read/write 
latency can be an important factor to consider when scheduling an algorithm. If the read 
operation can be initiated five cycles before the value is needed, the latency is invisible. 
Otherwise, the program must be coded to execute “no-op” instructions before continuing. 

In addition to data accesses, the processor must also be supplied with program 
instructions. At peak performance, it will be executing 32 bytes (not 32 bits) of instruction 
code with every 5 nanosecond clock cycle, or 6400 Megabytes per second. Practical algo- 
rithms may demand this bandwidth a substantial fraction of the time. The Internal Pro- 
gram Memory (IPM), containing 16384 instructions, provides this bandwidth through a 
256-bit wide path to the processor. 

Access to external (off-chip) memory is also supported. From the program’s point- 
of-view, internal and external memory both have a five clock latency. However, external 
memory actually takes longer to respond, so execution of all instructions is suspended as 


long as necessary for any data. The impact of this is shown in section E. 
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C. SOFTWARE TOOLS AND TECHNIQUES 

Development of an application for the C6x requires translation of the algorithm 
into a sequence of arithmetic operations, allocation of the arithmetic to the parallel func- 
tional units of the processor, allocation of variables to registers, and allocation of data 
structures to the memory hierarchy. A collection of tools is needed to perform these tasks. 

1. TI’s Tool Set: Code Composer Suite 

Implementation of the algorithm in “ANSI C” is typically the first stage of devel- 
opment. Though the programmer has the least amount of control over exploitation of par- 
allelism, a C program can be portable across many different machines and operating 
systems. Development in C can prove, in a workstation environment, that the algorithm is 
theoretically sound and worthy of integration into the DSP environment, with less effort 
than starting development on the DSP. In TI’s “Code Composer Suite” development envi- 
ronment includes several options for optimization of the executable program. 

Compiler “pragma” directives also allow the programmer more control over the 
compiling process than the standard C language provides. Data structures can be assigned 
to specific memory sections, and the programmer can exercise fine-grained control over 
some of the optimization efforts of the compiler. 

The C compiler for the C6x sold by TI can be used to compile ANSI C, but also 
supports processor-specific hardware features using “intrinsic functions.” An example is 
shown in Fig. 18. 


int sum_hl = 0, sum, index; 
short b[100]; 


for (index = 0; index < 100; index += 2) { 
sum_hl = _add2( sum_hl, *(int *)b[index]); 

} 

sum = (sum_hl & Oxffff) + (sum_hl>>16); 


Figure 18. Calculation Using Parallel Partial-Word Arithmetic. 
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This code fragment computes the sum of the short (16-bit) values in the array 
“‘b” by reading pairs from memory as int (32-bit integer) values, accumulating the even 
elements in the high 16 bits of “sum_h1,” accumulating the odd elements in the low 16 
bits of “sum_h1,” and finally combining the high and low partial sums into a total. Only 
50 loop iterations are needed to sum the 100 elements. Similarly, the “_mpyh1 ()” and 
“ mpylh () ” intrinsics allow the high half of one register to be multiplied by the low half 
of another, without disturbing the other half of either register. (This is especially handy for 
the complex arithmetic used in signal processing.) 

Intrinsic functions allow the C programmer access to specialized instructions of 
the C6x, but not to the allocation of variables to processor registers. If this additional level 
of control is needed to improve algorithm performance, the programmer can use “linear 
assembly language.” Each linear assembly statement specifies the operation of and argu- 
ments to one functional unit, and the assembly optimizer tool combines statements into 
parallel “execution packets.” Programming in linear assembly code requires the program- 
mer to manage the assignment of variables to registers (bearing in mind that only five spe- 
cific registers can be used as condition flags, and only eight can be used for certain 
addressing modes), the assignment of functions to functional units (e.g., though multipli- 
cation can be done only in the “.M” units, AND can be done in both .L and .S, and ADD 
can be done in .L, .S, and .D), and access to variables from functional units (A-side func- 
tional units can only write results to A-side registers). Linear assembly language allows 
the programmer to “misuse” processor control registers as scratchpad storage. Control 
registers cannot supply operands to arithmetic units, but can be more quickly accessed 
than even on-chip memory. Of course, the programmer must ensure that this does not pre- 
vent normal operation of the processor; any function which uses the Interrupt Return 
Pointer control register for data must never attempt to return from an interrupt service rou- 
tine without somehow restoring that value! 

For maximum control, the programmer can write scheduled assembly language, 
manually combining statements into parallel execution packets. Developing scheduled 


assembly language can be a challenging task. In addition to the complexities of linear 
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assembler, the developer must also be aware of constraints on operation parallelism, such 
as the fact that, of the eight operands which could be needed by the four A-side functional 
units, only one operand can be read from any of the sixteen B-side registers, and vice 
versa. The developer must also be aware of operation latencies. When loading a word from 
memory, the destination register can be used for other purposes for four more instruction 
cycles before the loaded word actually arrives. The internal latency of the multiply unit 
means that the result can’t be read from the destination register until two cycles after the 
multiply is issued. (This means, though, that a multiply unit can be used to simply store a 
value that would otherwise overflow the register set. Multiplication of the value by one in 
cycle k allows use of cycle k+1 to save the contents of the register which will be overwrit- 
ten by the multiplier result in cycle k+2. This saved substantial time in our FFT function.) 

Regardless of the language used to express the algorithm, the resulting executable 
program and data structures must be allocated to specific memory addresses. Embedded 
processors are typically surrounded by customized memory configurations and rely on 
physical hardware addresses (both for memory of various types, and memory-mapped 
input/output device registers). The program linker performs this function. 

Allocating data to a relatively slow memory can have a dramatic effect on perfor- 
mance. For example, our FFT code (with intermediate stage scaling) computed a 4K-point 
transform in 402 sec when the data vector and constant table were stored in Internal Data 
RAM. The same code, with constants in Synchronous Burst Static RAM, took 671 usec. 
After moving the data vector to SBSRAM, the algorithm took 3571 usec (see details 
below). 

2. Auxiliary Tools 

a. “Financial” Spreadsheet 

Optimization of an algorithm in scheduled assembly language (as defined 
above) can begin with scheduled assembly language code generated by a compiler, based 
on a portable description of the algorithm. Modifying the assembly language code can 
then incorporate the programmer’s understanding of the precise problem to be solved. (For 


example, the programmer may know that a certain variable can only take on one of three 
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different values, which cannot be expressed to the compiler but may simplify the prob- 
lem.) Or the programmer may be presented with a completely scheduled assembly-lan- 
guage program which needs only a slight change to perform the current application. In 
either case, the trick is to make an incremental change (add new behavior) without disrupt- 
ing the existing program. 

One way for the programmer to track the total processor state during con- 
current operations is with a spreadsheet program (e.g., Gnumeric or Excel). Though none 
of the calculation features of the program are used, the flexible tabular format is a valuable 
aid. For the effort described in this Thesis, a table was created with one row for each clock 
cycle of the program and one column for each of the eight functional units (and the two 
cross paths). If a functional unit executes an instruction during the clock cycle, its box is 
marked as shown below. This table is most useful to identify free functional units to which 
inserted instructions can be allocated. For example, if we need to multiply a value by four, 
we could use a multiply unit (if we have a register containing the value “4” and we can 
wait an extra clock cycle), or we could add it to itself (in L, S, or D units) twice, or shift it 
left two posititions (in an S unit). The table shows which (if any) are free to use. Also, 
when we need to insert a new execution packet, we can quickly scan down the M unit col- 


umns to see where multiplier latency will complicate the problem. 
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Scheduled assembly-language source Arithmetic unit use summarized for 








code for four clock cycles. 23 clock cycles (including the four at 
left). 
PE ig a a ae S1_D1 L1 M1 $2 D2 L2 M2 X X 
mv -s2x al0, b10 
22.5] x x X xX X 
| | mv -Slx b3, a9 
|| [!b2] addaw .dl a5, 1, a5 23) xX x Xx Xx 
24, x x xX xX X X xX X 
; 23 25} x xX X xX X Xx XXX 
shr eSsZ b3, 16, b3 25.51 x xX xX xX xX xX 
| | shr -Ssl al0, 16, al10 261 x x x xX xX XXX 
| | mv .1i1 a6, al 
| | addaw .dl a5, al, ad 27 SS 
27.5) x x xX X x 
; 24 28; x x x xX X Xx X X 
add .12  b3, bl10, b1l 29 XX X X X X XXX 
| | sub -1i1 a9, al0, al2 30 x xX XxX X X X XXX 
| | sub2 .s2x bl, a8, bl 311 x x x x x 
| | add2 .slx bl, a8, a8 321 xl xl x x xx xx 
| | addaw .dl a5, al, ad 
; 25 33 x x 
ext .sl a8, 16, 18, a8 34) x X X 
| | shr -s2x a8, 18, b10 35 x x 
|| sub .12  b3, bl10, b12 36 x 
|| add ll a9, al0, a9 37| x X 
| | mpylh .mlx a12, b15, al0 38 
I | mpy .m2  b11, b15, b10 . 
|| ldw .d2 *b5++[b6],b10 39 





Figure 19. Spreadsheet Summary of Arithmetic Unit Usage. 


The spreadsheet also tabulates usage of the thirty-two general-purpose reg- 
isters, which is essential for minimizing the number of data transfers to and from memory. 


The following table illustrates the same instructions (and more) as in Fig. 19. 
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Figure 20. Spreadsheet Summary of Register Usage. 


Each instruction cycle occupies one row of the sheet. The text in each cell 
was added according to the following rules: 

1. All cells are initially blank; when the analysis is complete, no cells should be 
blank. 

2. If a register receives a value (loaded from memory or the result of an arithmetic 
operation) during the prior cycle (such that it can be used as an operand in the current 
cycle), put the name of the variable into the associated cell. If the 32-bit register contains a 
pair of 16-bit values, separate their names with a slash (‘7’). In cycle 28, a pair of 16-bit 
values labeled “‘x/yi0” appears in register B10, the result of the “LDW” instruction that is 
shown in Fig. 19 (cycle 25). 

3. If the operation is iterative (e.g., i++), put in the expression. In the top row of 
Fig. 20, register B5 is incremented in cycle 22.5. In cycles 25.5 through 27.5, register BS 
is incremented by four times the value in register BO. 

4. If the value in a register is an instruction operand (including memory write) dur- 
ing the current cycle, put an asterisk in the cell. Register B6 is used to auto-increment B5 


in cycles 25 through 27. If this value will not be used again, underline the cell. 
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5. If a register must preserve its value through the cycle, put a period in the cell. 
Register B11 receives a new value in cycle 27.5, preserves it through 28, and it is read in 
cycle 29. 

6. Between the time a cell is underlined (as in rule 4), and it receives a new value 
(rule 3), put a dash into the cell. This cell is free to be used for a new instruction. B10, 
B11, and B12 are free in cycle 25.5. 

7. If the status of a register is dependent on a condition flag, prefix the cell text 
with a question mark. Register B15 is conditionally loaded with a pair of sixteen-bit val- 
ues for cycle 29. 

It is also useful to have columns in the table to indicate load/store opera- 
tions, as shown at the right side of Fig. 20. 

Once the spreadsheet is completed for a baseline algorithm (e.g., as pro- 
duced by the compiler), the impact of changes during manual optimization can be 
assessed. For this project, we needed to make two changes to the FFT routine provided by 
TI: first, TI developed for a “‘little-endian” memory access configuration, while our pro- 
cessor board is configured for “big-endian” memory access, and secondly, we wanted to 
prevent arithmetic overflow by implementing distributed scaling as described in section 
IL.A.5. 

Applying the rules above to the ten-cycle inner loop of the algorithm 
showed that there were 22 register cells (out of 320) free to receive new values, and 17 
(out of 80) free functional unit cells. Each instruction inserted into the loop creates eight 
more free functional unit cells, and from none to six more free register cells (depending on 
where in the loop it is inserted). Inserting an instruction cycle creates a free register cell 
only when it is inserted below a row with one or more cells underlined according to rule 4, 
above, or already free according to rule 6. 

In the end, accomplishing the required two changes resulted in an inner 
loop that takes 15 more nanoseconds (clock cycles designated 22.5, 25.5, and 27.5) than 


the original code. 
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b. Programs which Write Programs 

Once we have decided to use tabulated values in one or more parts of a pro- 
gram, we are faced with the potentially tedious and error-prone task of creating the table. 
Our solution is to write a portable C program which generates, as its output, another C 


source-code file which can be compiled by the DSP development tools. 


D. DESIGN FEATURES OF THE PENTEK 4290A 

A processor chip by itself is useless. Pentek, Inc. integrates four C6x processors 
with several types of memory and input/output devices into the model 4290A VME-bus 
single-board computer [Ref. 13]. Each C6x on the board has exclusive access to 128k 32- 
bit words of Synchronous Burst Static RAM (SBSRAM). During block transfers, a new 
data word can be transferred on every CPU clock cycle (800 MB/sec), which is half the 
rate of the on-chip IDRAM. [Ref. 13: p.13]. Access latency, however, can cause a signifi- 
cant performance penalty, as shown in the performance measurements table below. 

Each C6x also has exclusive access to 4M 32-bit words of Synchronous Dynamic 
RAM (SDRAM). At best, SDRAM has half the transfer rate (400 MB/sec) of SBSRAM, 
but data transfers stall for 60 nsec (12 clock cycles) whenever a page boundary is crossed 
[Ref. 13: p.14]. Again, access latency effects can be dramatic. 

The 4290A also has FIFO, dual-port, and global memories for interprocessor com- 
munication, but they are irrelevant to this discussion. 
E. EXPERIMENTAL RESULTS FROM DSP PROGRAMS 

1. Test Method and Conditions 

In this test, we observe the impact of memory latency, for each of the three types of 
memory (IDRAM, SBSRAM, and SDRAM) on the Pentek 4290A, for two algorithms: 
application of the window function and the 4K FFT. Three data structures are involved: 
jifo_buf contains complex short-integer sampled data to be processed; /ft_buf is the work- 
ing memory of the FFT; w contains the tabulated FFT coefficients. 

As the first stage in spectrum estimation, we multiply each component of each 


complex element of the input data vector (in fifo_buf) by the corresponding element of the 
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tabulated window function (in the odd-indexed elements of w). To save space, we derive 
the VonHann window function “on the fly” from the tabulated FFT constants, as shown in 


Fig. 21. 


fifo_ptr = &fifo_buf[0]; /* Point to first input value (from A/D) .*/ 


win_ptr = éw[1]; /* Point to first cos() value. */ 

fft_ptr = &fft_buf[0]; /* Point to first output value (to FFT). */ 

for (ii=0; ii < 4096; iitt) { 
tmp = 16384 - (*win_ptr >> 1); /* Scale to avoid overflow. */ 
win_ptr += 2; /* Skip over sin() value to next cos(). */ 
*fft_ptr++ = (*fifo_ptrt++ * tmp) >> 16; /* Scale to restore ...*/ 
*fft_ptr++ = (*fifo_ptr++ * tmp) >> 16; /* ..fixed-point format. */ 


Figure 21. ANSI C Source Code for Window Function Algorithm. 


Then the 4K FFT algorithm (not including the data unscrambling phase) is applied 
to data in fft_buf. (Since Welch’s method for spectrum analysis involves the sum of peri- 
odograms, we unscramble the final result, rather than each raw FFT output, so the 
unscrambling algorithm is not included in these measurements.) 

2. Test Results 

The execution time of the window and FFT functions were measured using TI’s 


Code Composer Suite performance profiling timer, with results as shown in Table 4. 
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Table 4. Pentek DSP Memory Performance. 


























: clock time ratio to 5: 

function . conditions 
cycles | (Usec) | best time 

win 10,514 53 all (fifo_buf, fft_buf, and w) in 

fft 83,737 420 IDRAM 

win 117,079 585 fifo_buf in SBSRAM 

fft 80,659 403 

win 165,268 826 15.72 | fifo _buf in SDRAM 

fft 80,658 403 1.00 

win 63,768 w in SBSRAM 

fft 134,218 

win 87,335 w in SDRAM 

fft 245,585 

win 61,795 309 fft_buf in SBSRAM 

fft 714,198 3571 

win 78,633 fft_buf in SDRAM 

fft 1,434,998 

win 253,007 fft_buf and w in SDRAM 

fft 1,619,164 

win 657,522 3288 62.54 | allin SDRAM 

fft 1,622,313 8112 20.13 


This table clearly shows the importance of putting frequently used data into the 
Internal Data RAM of the processor. Though the SBSRAM is described as “zero wait- 
state” memory, since a new value can be read on every clock cycle, attempting to use it for 
the FFT’s data vector wastes over 88% of the processor’s performance due to access-time 
latency. With all three data structures in SDRAM, the processor spends 95% of its time 
waiting. 

The fastest configuration applied the window to 4096 complex data elements in 
only 10,514 cycles, or about 2.5 clock cycles per complex element. (Remember, reading 
from IDRAM takes five clock cycles, multiplication takes two, and branching back to the 


top of a loop takes six.) The C compiler exploits the parallelism of the window algorithm 
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(every data element is independent), but how does it perform against the complexity of a 
portable FFT routine? 

The “Stockham algorithm, radix-four, with tabulated constants, alternating work 
buffers, and integer arithmetic” (algorithm St-F of Chapter II, converted to integer arith- 
metic with internal scaling) was evaluated with Code Composer Suite’s simulator over a 
set of compiler-optimization switches. The portability of the function was not impaired by 
using intrinsic functions or pragmas. Execution times for 4K transforms are listed in Table 


5. 
Table 5. TI DSP Compiler Performance. 


optimization | cycles time optimization features 





none 2380975 | 11.9 msec | most easily debugged, due to the clear association of source 
and executable code, memory is up-to-date after each C 
source line has completed. 





-00: register 1390231 | 6.95 msec | simplifies control flow, assigns variables to registers, sim- 
plifies expressions and statements, etc. 


-o1: local 960201 4.80 msec | adds local copy propagation, eliminates local common 
expressions, etc. 





-o2: function 270106 1.35 msec | adds conversion of array references to incrementing point- 
ers, loop unrolling, loop optimizations, software pipelin- 
ing., etc. 





-03: file 270106 1.35 msec | makes short functions in-line, propagates argument into 
function when function argument always has the same 
value, etc. 

















Comparing the best compiler-optimized portable routine (1350 usec) against the 
best hand-optimized routine (403 sec), we see that the compiled routine takes more than 
three times as long. Inspection of the scheduled assembly language produced by the com- 
piler shows that at most six of the eight functional units are coded into a single execution 
cycle (and then, in only one of the 182 cycles in the function), and two cycles use five 
units. For the rest of the program, at least half of the functional units are idle. In the inner 
loop, 94 instructions will execute in 49 cycles, so we average less than two instructions per 
cycle. There is, therefore, a strong incentive to manually optimize time-critical portions of 


an application or incorporate a well-optimized library routine. 


50 


F. COMPARING PENTIUM-III AND C6X 

In Chapter II-D we saw that the Pentium-III could compute a floating-point 4K 
FFT (with properly sorted output) in 320 usec, while in section III-E, the C6201 took 403 
usec to do an integer calculation (without sorting the output). The Pentium-III was faster 
and easier to program. So why should we consider DSP devices? 

CPU cost: the TI DSP chip currently sells for about $40 (in 1000-unit quantities) 
[Ref. 13], while 1 GHz Pentium-compatible AMD Athlon 4 processors cost $425 (in 
1000-unit quantities) [Ref. 14]. 

System cost: The cost of the DSP device includes on-chip serial ports, DMA con- 
trollers, and memory controllers, which would be external to the Pentium. On the other 
hand, the volume of Pentium-system sales provides economies of scale which are lacking 
in the DSP-on- VMEbus market. 

Physical size: Our DSP system puts four processors on each VME card, while our 
Pentium system packages just one CPUs per card. 

State of the art: the clock speed of C6x family devices has increased by a factor of 
three since the C6201 was introduced, while Pentium speeds approach 1500 MHz, so a 
current DSP chip may outperform a current Pentium. However, Pentium-class processors 
reach the commercial marketplace much more quickly than DSP devices do (Pentek’s 
Model 4290 Cox board is still plagued with functional and/or documentation faults), so 
low-volume DSP developers may always lag behind. 

G. LARGE TRANSFORMS ON DSP 

The Internal Data RAM of the C6201 will not be large enough to store the coeffi- 
cient tables and data vectors for transforms of greater than 4K elements. To avoid the 
severe performance penalty of fetching data elements one at a time from external memory 
(as illustrated in Table 4), the C6201’s internal Direct Memory Access (DMA) controllers 
can be used to swap work-vectors in and out of IDRAM, much as the cache controllers 
described in Chapter II do for general-purpose processors. The DMA controllers, however, 
can be programmed with arbitrary “stride” values for reading and writing, which allows 


them to perform the transpose operations without burdening the processor. It may be pos- 


51 


sible for DMA transfers to take place concurrently with FFT calculations, without slowing 


the FFTs, but the actual effectiveness of this technique remains to be determined. 
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IV. CONCLUSION 


This thesis has explored a variety of issues involved in developing an FFT-based 
spectrum analyzer. We’ve shown how a popular “textbook” algorithm for the FFT can be 
modified to run faster on a general-purpose computer, and how main-memory access can 
become a bottleneck which limits the performance of the processor on large (greater than 
64K) FFTs. An algorithm (program “Fact-T” of the appendix) was presented which recov- 
ers much of the lost performance by improving the effectiveness of the cache manage- 


ment. For a 4K transform, the modified routine runs in half the time of the original; for a 


1K? transform, one seventh. 

Several options for spectrum analysis window function design and storage were 
assessed with results that must be considered in an application context. The Von Hann 
window was found to provide acceptable spectrum estimates for our application, while 
allowing reuse of trigonometric factors previously tabulated for the FFT itself. Decimation 
and interpolation of window function data was found to produce spurious features in the 
spectrum which diminish for large tables and higher-order interpolation algorithms. 

We implemented the FFT on a fixed-point digital signal processor, the 
TMS320C6201, taking advantage of assembly language software development tools 
which allow parallel functional units to be optimally exploited. An algorithm for sorting 
FFT results into natural order was developed which is more memory efficient than an 
algorithm described in TI application notes. 

Implementation of manually scheduled assembly language programs was shown to 
be facilitated by describing the resources and dependencies of the program in a general- 
purpose spreadsheet tool. 

Run-time measurements on an MIPS R4000 RISC processor, Intel Pentium-3, and 
TI TMS320C6201 illustrate the advances in architecture which allow the Pentium-3 and 
‘C6201 to provide performance which goes beyond that which would be predicted based 


solely on their shortened clock cycles. For a portable C 4K FFT, the Pentium requires less 


than half as many clock cycles as the R4000; for a 1K? FFT, superior cache management 
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gives it a four to one advantage per clock cycle. The 200 MHz C6201 running TI-opti- 
mized code is essentially equal in performance to the 733 MHz Pentium-3 running Intel- 
optimized code (at the 4K size), having a 3.5 to one advantage per clock cycle but a 3.7 to 
one handicap in clock speed. Intel has recently announced that the Pentium clock speed 
may double, while TI is promising 600 MHz versions of its product line, so the perfor- 
mance race may remain close for the foreseeable future. Considerations other than arith- 
metic performance, such as size, cost, power consumption, arithmetic precision, and 
designer familiarity outweigh their relative performance on generic benchmark programs. 
Determining whether or not a DSP device can satisfy a challenging performance require- 
ment is likely to require substantial investment to implement a realistic test, and coordina- 
tion with the manufacturer to determine component availability. 


A 4K FFT can run without reference to off-chip memory, which is demonstrated to 


be essential for efficient operation. For larger transforms (e.g., 1K2, 4K?), the C6201 must 
rely on DMA transfer of intermediate results between internal and external memories to 
maintain performance anywhere close to small transform performance. The large-FFT 
portable program “Fact-T” illustrates the sequence of block data movements and arith- 


metic. 
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APPENDIX: PORTABLE SOURCE CODE 


A. Str4flip: An out-of-place radix-four small FFT algorithm 
/ * 


Stockham Radix-four FFT (unit-stride, not in-place) 

based on Algorithm 2.4.2, Stockham Radix-four Algorithm, in 
Computational Frameworks for the Fast Fourier Transform, SIAM, 
by Charles Van Loan. 





This version uses tabulated twiddle factors, for speed optimization. 
It also flip-flops samples between buff and data, to avoid copying. 
If the base-4 log of the FFT size is odd, 

then the result is returned in the workspace “buff,” 

else the result is returned in the input array “data.” 


xf: 

#include <stdio.h> 

#include <math.h> 

/* M_PI is defined in math.h */ 

#define MAIN 1 

/* 
The Stockham version has unit (complex) stride for each pass; data is 
in order before and after transform, but the computation is not 
“in-place.” 
“buff” is a workspace equal in size to “data.” 

* yf 

/* 
Initialize the trig-factor table. As usual, nn is the number of complex 
samples in the FFT, so the table has nn/4 complex (= nn/2 float) val- 

ues. */ 

void init_table( float w_tab[], const int nn ) 


{ 





int JJ; 

double temp; 

for (j33=0; jj < nn / 4; Jytt) { 
temp = 2.0 * M_PI * (double) jj / (double) nn; 
w_tab[2*jj ] = (float) cos( temp ); 
w_tab[2*jj + 1] = (float) -sin( temp ); 


void str4flip( float data[], float buff[], float w_tab[], 
const int nn, const int log4n) 


int j, k, gq, xr, r_star, ell, ell_star; 

float wr, wi, w2r, w2i, w3r, w3i; 

float .ary aly br, bi, er, sel; dr; di: 

float “t0K, 601, tir; tht, ber, 2a) b38r, ES 
float. *di ptr; *b ptr, *swapaptr; 
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int index; 
int w_index; 


d_ptr = data; /* Pre-swap the pointers. */ 
b_ptr = buff; 
for (q = 1; q <= log4n; qt+) { /* For each pass through the data */ 
ell = (int) pow( 4.0, (double) q ); 
r =nn / ell; 
ell_star = ell / 4; 
Gusta ed. * are 
/* Swap data/buff pointers; effectively swapping the input and 
output buffers used below. */ 
swap_ptr = d_ptr; 
d_ptr = b_ptr; 
b_ptr = swap_ptr; 





for (j=0; 3 < ell_star; j++) { 





w_index = 2 * j * xr; 

wr = w_tab[ w_index++ ]; 

wi = w_tab[ w_index lie 

w2r = wr * wr - wi * wi; w2i = wr * wi + wi * wr; 
w3r = wr * w2r - wi * w2i; w3i = wr * w2i + wi * w2r; 


for (k=0; k < r; k++) { 
index = 2*(j * r_star + k); 











































































































ar = *(b_ptr + index ); 

ai = *(b_ptr + index + 1); 

index += 2 * r; 

br = wr * *(b_ptr + index ) - wi * *( b_ptr + index + 1 
bi = wi * *(b_ptr + index ) + wr * *( b_ptr + index + 1 
index += 2 * r; 

cr w2r * *(b_ptr + index ) - w2i * *( b_ptr + index + 1 
Ci = w2i-* * (b=ptr index ) w2r * *( b_ ptr + index + 1 
index += 2 * r; 

dr w3r * *(b_ptr + index ) - w3i * *( b_ptr + index + 1 
di = w3i * *(b_ptr index ) w3r * *( b_ptr + index + 1 
tOr = ar + cr;t0Oi = ai + ci; 

tin = ar= -cr tli, =-at-= Ci; 

t2y = br + .dret2a (= bi + edi: 

t3r = br - dr;t3i = bi - di; 

index = 2 * (j * r + k); 

*(d_ptr index )=t0r t2r;*(d_ptr + indext+1) = t0i+t 
index += 2 * r * ell_star; 

*(d- ptr index )=tir t3i;*(d_ptr + indext+l1) = tli t 
index += 2 * r * ell_star; 

*(d_ptr index )=t0Or t2r;* (d_ptr + indext+l) = t0i t 
index += 2 * r * ell_star; 

*(d_ptr index )=tlr E319 * (d_ptr index+l) =tli+t 
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#if (MAIN) /* turn on for stand-alone testing. 
#define LEN (1024*16) 

/* Note that the base-4 log is half of the base-2 log. 
#define LOGLEN (5+2) 

#define TESTLOOPS 1 














void main ( 


{ 


void ) 














float data[ LEN * 2 ], work[ LEN * 2 J], 
THE, pk? 
/* Synthesize some input data. */ 

for (j = 0; 3 < LEN; j++) { 
data[2*j] = 0.0; /* cos( j * M_PI / 2.0 
data[2*j+1] = 0.0; /* sin( j * M_PI / 2.0 

} 

data[3] = 1.0f; 

init_table( w_tab, LEN ); 

for (k = 0; k < TESTLOOPS; k++) { 








/* Since str4flip does not scale its result, 


str4flip to a buffer quickly leads to 

exceptions. Re-create the data vector 

Make sure loops are not identical, or 

looping! */ 
data[2] = k; 
str4flip( data, 
/* 
} 








work, 
Examine results 


w_tab, Ll 
(not shown). 


EN, 
ef 


LOGLEN 











} 
#endif /* MAIN switch, 
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w_tab[ LEN / 2 


for stand-alone testing. 


*/ 


*/ 





l; 


recursive application of 
numeric overflow processor 
to test with many loops. 
optimizer may prevent 


i 


ay 


B. Cxtranspose: Efficiently transpose an array of complex elements 


/* Implements an out-of-place transpose of a rectangular complex data 
array. The dimensions of the array must be integer multiples of 
BLOCK_DIM, below. For a 1024x1024 array, the time required to perform the 
transpose is 0.6 sec (with 32x32 block size). A straightforward algorithm 
took 2.1 sec.*/ 





include <stdio.h> 

/* This version transposes square blocks, to improve cache performance. 
BLOCK_DIM of 16 and 32 are about equal, but 64 is worse. */ 

define BLOCK_DIM 16 

void cxtranspose( float in_data[], float out_datal[], 

const int rows, const int cols ) 





int ii, jj, kk, mm;/* General loop counters. */ 
float *in_ptr[BLOCK_DIM];/* Array of input pointers. */ 
float *out_ptr[BLOCK_DIM];/* Array of output pointers. */ 
/* In_ptrs point to the first columns of the first N rows. */ 
for (kk = 0; kk < BLOCK_DIM; kk++) 





in_ptr[ kk ] = &in_data[ 2 * cols * kk ]; 
/* For each block-row... */ 
for (jj = 0; 434 < (rows / BLOCK_DIM); jj++ ) { 
/* Out_ptrs point to left edge of current block-column.*/ 
out_ptr[0] = é&0ut_data[2 * BLOCK_DIM * jj]; 
for (mm = 1; mm < BLOCK_DIM; mmt++) 
out_ptr[ mm ] = out_ptr[ mm-l1 ] + 2 * rows; 
/* For each block in the block-row... */ 
for (ii=0; ii < (cols / BLOCK_DIM); iitt+) { 
/* For each row in the block... */ 
for (kk=0; kk < BLOCK_DIM; kk++) { 
/* For each column in the block... */ 





for (mm=0; mm < BLOCK_DIM; mmt++ ) { 

/* Copy data from one row pointer to each column pointer. */ 
*out_ptr[mm] *in_ptr[kk]++; /* Real part. */ 
*out_ptr [mm] *in_ptr[kk]++; /* Imag part. */ 





























} 








/* Done with a block; move the out_ptrs down to the next block in 
the block-column. (The in_ptrs just increment into the next 
block in the block-row.) */ 

out_ptr[0] += 2 * BLOCK_DIM * (rows - 1); 
for (mm = 1; mm < BLOCK_DIM; mmt++) 
out_ptr[ mm ] = out_ptr[ mm-l1 ] + 2 * rows; 








/* Done with a block-row; move in_ptrs to next block-row. */ 
for (kk = 0; kk < BLOCK_DIM; kk++) 
in_ptr[ kk ] += 2 * cols * (BLOCK_DIM - 1); 
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C. Fact-T: A large FFT algorithm 


/* 
This program builds on the str4flip Stockham radix-four algorithm to 
build a large FFT, using the ideas in Section 3.3 of Van Loan. 


Note that Van Loan’s terminology assumes Fortran’s array organization, 
where successive rows within a column are adjacent in memory. In this 
ANSI-C program, successive elements are assumed to lie in the same row. 
*f 

#include <stdio.h> 

#include <math.h> 

/* M_PI is defined in math.h */ 





/* Declare the FFT and complex-transpose external routines. */ 
extern void str4flip( float * data, float * work, 

const int length, const int log4_length ); 
extern void cxtranspose( float * in_data, float * out_data, 
const int rows, const int cols ); 

define LEN1 256 

define LEN2 4096 

define LEN (LEN1 * LEN2) 

define MAXLEN LEN2 /* whichever is bigger */ 

define LOGLEN1 4 

define LOGLEN2 6 

define TESTLOOPS 2 

































































/* 
Initialize the block-multiply twiddle-factor array. Note that the sign 
on the “sin” term is negative, consistent with Van Loan’s text. 

*f 





void init_twiddle( float tw[], const int lenl, const int len2 ) 
{ 

int ii, jj, index; 

double temp; 


temp = 2.0 * M_PI / ((double) (lenl * len2)); 





for (ii = O; ii < lenl; iit+ ) { 
for (jj = 0; 33 < len2; jjtt+ ) { 
index = ii * len2 + jj; 
tw[ 2 * index ] = tcos( temp * ii * jj ); 
tw[ 2 * index + 1] = -sin( temp * ii * jj ); 
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/* Apply the block-multiply constants. * 


void use_twiddle ( 


len2 ) 
{ 


float datal[], 


























/ 


float tw[], 

















const int lenl, 


; /* two floats per complex */ 


he greater of LEN1 and LEN2. 








int ii, jj, index; 
float temp; 
for (ii = O; ii < lenl; iit+ ) { 
for (jj = 0; 33 < len2; jjtt+ ) { 
index = 2 * (ii * len2 + j9); 
temp = data[ index ] 
data[ index + 1 ] * tw[ index + 1 ]; 
data[ index + 1] = data[ index + 1 ] 
data[ index ]: tw aides: i].s 
data[ index ] = temp; 
} 
} 
} 
void main( void ) 
{ 
float datal LEN * 2 ] 
float data2[ LEN * 2 ]; /* transposed */ 
float twiddle[ LEN * 2 ]; 
float work[ MAXLEN * 2 ]; /* T 
int j, k; 
init_twiddle( twiddle, LEN1, LEN2 ); /* opera 
for (k = 0; k < TESTLOOPS; k++) { 
/* Read (or synthesize) test data 


/* Now, 


cxtranspo 
for (j = 


se( da 
OGy 2S 


str4flip( é&da 


} 

use_twidd 
cxtranspo 
for (j = 


le( da 
se( da 
OF a Ss 


str4flip( é&da 


} 
cxtranspo 
/* Output 


se( da 
(or Cc 








compute the transform. 














a 











heck) 














ta, data2, LE 








result 
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ta, data2, LEN2, LENI1 ); 
LEN1; j++) { 

ta2[ j * 2 * LEN2 ], 
ta2, twiddle, LEN1, LEN2 
ta2, data, LEN1, LEN2 ); 
LEN2; j++) { 

ta[j * 2 * LEN1], wor 





(deleted) */ 





work, LEN2, LOGLEN2 








k, LEN1, LOGLENI ); 











const int 


tes on transpose */ 
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